20 research outputs found

    NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding.

    Get PDF
    Machine reading (MR) is essential for unlocking valuable knowledge contained in millions of existing biomedical documents. Over the last two decades1,2, the most dramatic advances in MR have followed in the wake of critical corpus development3. Large, well-annotated corpora have been associated with punctuated advances in MR methodology and automated knowledge extraction systems in the same way that ImageNet4 was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named entity analysis tool for biomedicine: (a) a new, Named Entity Recognition Ontology (NERO) developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named entity recognition (NER) automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus

    Very little information is shared across multiple biomedical terminologies.

    No full text
    <p>(A) The panel on the left illustrates the overlap among the concepts annotated by the terminologies documenting <i>Diseases and Syndromes</i>. The figure itself is composed of ten concentric rings, with the outermost ring (<i>k</i>ā€Š=ā€Š1) indicating the colors assigned to each dataset. The next ring (<i>k</i>ā€Š=ā€Š2) displays the overlap in concepts among all pairwise comparisons, arranged in clockwise order starting with the intersection (MSH, NCI). The extent in overlap was computed by dividing the number of co-occurring annotations by the maximum possible number given the sizes of the terminologies being intersected (percent maximum overlap). This information is displayed within the concentric ring using bi-colored bars, whose heights depict the percent maximum overlap for the terminologies indicated by the colors. The panels on the right illustrate this idea by enlarging a section of the original figure, highlighting a particular intersection (NCI, CHV), and explaining how the colored bar translates into the percent maximum overlap. The remaining concentric rings (<i>k</i>ā€Š=ā€Š3ā€¦10) display the overlap extent for all higher order intersections (3-way, 4-way, etc.), with each ring containing colored bars. (B) This figure illustrates the overlap among terms annotated to each concept for the same ten datasets depicted in (A). (C, D) These panels show the overlap in concepts (C) and terms (D) for the <i>Pharmacological Substances</i> terminologies. Note that only the ten largest datasets were included in each panel for the sake of clarity.</p

    Undocumented, general-English headwords and near-synonyms can be acquired experimentally.

    No full text
    <p>(A) The distribution over the inferred accuracies of the annotators validating harvested synonyms. (B) The true positive rate (blue) and false discovery rate (red) of the validation process as a function of the posterior probability of annotation accuracy. Diagnostic statistics were computed using known and random pairings. (C) The Receiver-Operator-Characteristic curve for the statistical model of the validation process, computed using known and random pairings. (D) The distribution over the posterior log-odds in favor of annotation accuracy for the novel synonym-headword pairings, annotated with exemplar pairings (rejected in red and accepted in blue). (E) The distributions over semantic similarity scores for the true negative (red), true positive (green), and novel synonym pairs (blue). (F) Bootstrapped (10,000 re-samples) distributions over the average semantic similarity scores for each group of pairings, computed using the data depicted in (E).</p

    Quantifying the Impact and Extent of Undocumented Biomedical Synonymy

    No full text
    <div><p>Synonymous relationships among biomedical terms are extensively annotated within specialized terminologies, implying that synonymy is important for practical computational applications within this field. It remains unclear, however, whether text mining actually benefits from documented synonymy and whether existing biomedical thesauri provide adequate coverage of these linguistic relationships. In this study, we examine the impact and extent of undocumented synonymy within a very large compendium of biomedical thesauri. First, we demonstrate that missing synonymy has a significant negative impact on named entity normalization, an important problem within the field of biomedical text mining. To estimate the amount synonymy currently missing from thesauri, we develop a probabilistic model for the construction of synonym terminologies that is capable of handling a wide range of potential biases, and we evaluate its performance using the broader domain of near-synonymy among general English words. Our model predicts that over 90% of these relationships are currently undocumented, a result that we support experimentally through ā€œcrowd-sourcing.ā€ Finally, we apply our model to biomedical terminologies and predict that they are missing the vast majority (>90%) of the synonymous relationships they intend to document. Overall, our results expose the dramatic incompleteness of current biomedical thesauri and suggest the need for ā€œnext-generation,ā€ high-coverage lexical terminologies.</p></div

    Environmental and state-level regulatory factors affect the incidence of autism and intellectual disability.

    No full text
    Many factors affect the risks for neurodevelopmental maladies such as autism spectrum disorders (ASD) and intellectual disability (ID). To compare environmental, phenotypic, socioeconomic and state-policy factors in a unified geospatial framework, we analyzed the spatial incidence patterns of ASD and ID using an insurance claims dataset covering nearly one third of the US population. Following epidemiologic evidence, we used the rate of congenital malformations of the reproductive system as a surrogate for environmental exposure of parents to unmeasured developmental risk factors, including toxins. Adjusted for gender, ethnic, socioeconomic, and geopolitical factors, the ASD incidence rates were strongly linked to population-normalized rates of congenital malformations of the reproductive system in males (an increase in ASD incidence by 283% for every percent increase in incidence of malformations, 95% CI: [91%, 576%], p<6Ɨ10(-5)). Such congenital malformations were barely significant for ID (94% increase, 95% CI: [1%, 250%], pā€Š=ā€Š0.0384). Other congenital malformations in males (excluding those affecting the reproductive system) appeared to significantly affect both phenotypes: 31.8% ASD rate increase (CI: [12%, 52%], p<6Ɨ10(-5)), and 43% ID rate increase (CI: [23%, 67%], p<6Ɨ10(-5)). Furthermore, the state-mandated rigor of diagnosis of ASD by a pediatrician or clinician for consideration in the special education system was predictive of a considerable decrease in ASD and ID incidence rates (98.6%, CI: [28%, 99.99%], pā€Š=ā€Š0.02475 and 99% CI: [68%, 99.99%], pā€Š=ā€Š0.00637 respectively). Thus, the observed spatial variability of both ID and ASD rates is associated with environmental and state-level regulatory factors; the magnitude of influence of compound environmental predictors was approximately three times greater than that of state-level incentives. The estimated county-level random effects exhibited marked spatial clustering, strongly indicating existence of as yet unidentified localized factors driving apparent disease incidence. Finally, we found that the rates of ASD and ID at the county level were weakly but significantly correlated (Pearson product-moment correlation 0.0589, pā€Š=ā€Š0.00101), while for females the correlation was much stronger (0.197, p<2.26Ɨ10(-16))

    Most near-synonymous relationships among general English words are undocumented.

    No full text
    <p>The overlap among the (A) headwords and (B) synonymous relationships annotated within nine general-English thesauri. (C) The number of known (above x-axis) and undocumented (below x-axis) headwords belonging to each of the ten, headword-specific mixture model components (see Supporting Information <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003799#pcbi.1003799.s014" target="_blank">Text S1</a>). (D) The number of known (above x-axis) and undocumented (below x-axis) synonymous relationships belonging to each mixture component. The blue bars indicate undocumented relationships paired to known headwords while the red bars indicate undocumented relationships paired to latent headwords. (E) The number of synonymous relationships is shown as a function of the total number of headwords in the English language. The width of the line indicates the 99% confidence interval for the estimate (see Supporting Information <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003799#pcbi.1003799.s014" target="_blank">Text S1</a>). (F) The distribution over the number of synonyms annotated per headword (gray) is compared to the theoretical distribution obtained using best-fitting statistical annotation model (blue). The <i>R</i><sup>2</sup>-value indicates the fraction of variance in synonym number explained by the model. For reference, log-Gaussian and geometric models were fit to the data as well (red and green, respectively), although their quality of fit was several thousand orders of magnitude worse than the best fitting annotation model (according to marginal likelihood). (G) Box-whisker plots depicting the mean relative word frequencies (1,000 bootstrapped re-samples) for each of the ten headword-specific mixture components. For reference, the probability of headword annotation, marginalized over all possible synonym pairs, is plotted in green. (H) The three curves indicate the expected fraction of undocumented synonymy that would be discovered upon repeatedly and independently constructing additional lexical resources (x-axis) identical to the complete dataset (blue), WordNet only (red), and WordNet plus Webster's New World (green).</p

    Biomedical terminologies are likely missing the vast majority of domain-specific, synonymous relationships.

    No full text
    <p>The numbers of undocumented concepts and synonyms specific to each biomedical sub-domain were estimated using a hierarchical mixture model in order to capture annotation variability that occurred within and across terminologies (10 concept components, each with 4 synonym components, see <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003799#s4" target="_blank">Materials and Methods</a> and Supporting Information <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003799#pcbi.1003799.s014" target="_blank">Text S1</a>). In panels (A) and (B), the number of documented concepts per component (green, above x-axis) is compared to the estimated number of undocumented concepts per component (blue, below x-axis): (A) <i>Diseases and Syndromes</i> and (B) <i>Pharmacological Substances</i>. In panels (C) and (D), the number of documented synonyms per mixture component (green, above x-axis) is compared to the estimated number of undocumented synonyms, which come in two flavors, undocumented synonyms paired to documented concepts (blue, below x-axis) and undocumented synonyms paired to undocumented concepts (red, below x-axis).</p
    corecore